IBM Applied Data Science Capstone Project

Segmenting and Clustering Neighborhoods in London

In this notebook, neighborhoods in the city of London are explored, segmented, and clustered. For the London neighborhood data, a Wikipedia page exists that has all the information needed to explore and cluster the neighborhoods in London. The data is scraped from the Wikipedia page and wrangled, cleaned and then read into a pandas dataframe so that it is in a structured format.

Once the data is in a structured format, Analyze to open a Japanese restaurant and where would we recommend that to open it?

Our goal is to perform a Segmenting and Clustering Neighborhoods in London and assist who is looking to open an Japanese restaurant in the city of London with some recommendations.

Introduction

Business Problem

The aim is to help restaurant chain owners and/or investors who are looking to open and/or invest in an Japanese restaurant in the city of London.

To solve this problem, we need to complete the following steps with London data:

List of Neighborhoods in London

This London data is extracted from the Wikipedia page titled, ‘List of areas of London’ (https://en.wikipedia.org/wiki/List_of_areas_of_London). Using the BeautifulSoup and Requests packages of Python, the required data is scraped from the webpage.

Latitude and Longitude coordinates of city London

We will fetch the location data of from the Python Geocoder package.

In order to use the Foursquare API, we fetch the location data of all these neighbourhoods from the Python Geocoder package. Next, the Foursquare API is used to get the venues of neighborhoods.

Clustering

To prepare the data for K-means clustering, we group the data frame by neighborhoods. Lastly, K-means clustering in performed on this data set to return 4 clusters, or categories of neighborhoods in terms of number of Japanese Restaurants.

This project would be encompassing a series of Data Science techniques, including, Web Scraping (using BeautifulSoup and Requests), Data Cleaning, Data Wrangling and Machine Learning (K-Means clustering algorithm).

Methodology

Install the required packages.

In [2]:
#!pip install arcgis
#!pip install wikipedia
#!conda install -c conda-forge geopy --yes
#!pip install geocoder
#!pip install folium
print('Libraries Installed.')
Libraries Installed.

Importing the required packages.

In [45]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values

import geocoder # to get coordinates
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe

# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

import folium # map rendering library

import wikipedia as wp

import folium # map rendering library

from arcgis.geocoding import geocode
from arcgis.gis import GIS
gis = GIS()
print('Libraries imported.')
Libraries imported.

The task is to explore the city and plot the map to show the Neighborhoods being considered and then build our model by clustering all of the similar Neighborhoods together and finally plot the new map with the clustered Neighborhoods.

1. Web-scrape and Explore Dataset

Exploring London City

Neighborhoods of London

Collecting data needed for the our business solution from Wiki.

Data Collection

In [4]:
#Get the html source
html = wp.page("List of areas of London").html().encode("UTF-8")
df = pd.read_html(html, flavor='html5lib')[1]     
df.head()
Out[4]:
Location London borough Post town Postcode district Dial code OS grid ref
0 Abbey Wood Bexley, Greenwich [7] LONDON SE2 020 TQ465785
1 Acton Ealing, Hammersmith and Fulham[8] LONDON W3, W4 020 TQ205805
2 Addington Croydon[8] CROYDON CR0 020 TQ375645
3 Addiscombe Croydon[8] CROYDON CR0 020 TQ345665
4 Albany Park Bexley BEXLEY, SIDCUP DA5, DA14 020 TQ478728

Data Preprocessing

In [5]:
df.rename(columns=lambda x: x.strip().replace(" ", "_"), inplace=True)
df.head()
Out[5]:
Location London borough Post_town Postcode district Dial code OS_grid_ref
0 Abbey Wood Bexley, Greenwich [7] LONDON SE2 020 TQ465785
1 Acton Ealing, Hammersmith and Fulham[8] LONDON W3, W4 020 TQ205805
2 Addington Croydon[8] CROYDON CR0 020 TQ375645
3 Addiscombe Croydon[8] CROYDON CR0 020 TQ345665
4 Albany Park Bexley BEXLEY, SIDCUP DA5, DA14 020 TQ478728

Feature Selection

Keep only relavant boroughs, Post town and district for further steps.

In [6]:
df1 = df.drop( [ df.columns[0], df.columns[4], df.columns[5] ], axis=1)
df1.columns = ['borough','town','post_code']
df1['borough'] = df1['borough'].map(lambda x: x.rstrip(']').rstrip('0123456789').rstrip('['))
df1.head()
Out[6]:
borough town post_code
0 Bexley, Greenwich LONDON SE2
1 Ealing, Hammersmith and Fulham LONDON W3, W4
2 Croydon CROYDON CR0
3 Croydon CROYDON CR0
4 Bexley BEXLEY, SIDCUP DA5, DA14

Dimension of the dataframe

In [7]:
df1.shape
Out[7]:
(531, 3)

We currently have 531 records and 3 columns of our data. Lets do the Feature Engineering

In [8]:
df1 = df1[df1['town'].str.contains('LONDON')]
df1.head()
Out[8]:
borough town post_code
0 Bexley, Greenwich LONDON SE2
1 Ealing, Hammersmith and Fulham LONDON W3, W4
6 City LONDON EC3
7 Westminster LONDON WC2
9 Bromley LONDON SE20
In [127]:
df1.shape
Out[127]:
(308, 3)

We now have only 308 rows. We can proceed with our further steps. Getting some descriptive statistics

2. Fetch Latitude and Longitude of each Neighborhood

We need to get the geographical co-ordinates for the Neighborhoods to plot out map. We have arcgis package for that.

In [9]:
def get_x_y_uk(address1):
   lat_coords = 0
   lng_coords = 0
   g = geocode(address='{}, London, England, GBR'.format(address1))[0]
   lng_coords = g['location']['x']
   lat_coords = g['location']['y']
   return str(lat_coords) +","+ str(lng_coords)

Checking geographical co-ordinates

In [10]:
CoordinatesUK = df1['post_code']    
CoordinatesUK.head()
Out[10]:
0       SE2
1    W3, W4
6       EC3
7       WC2
9      SE20
Name: post_code, dtype: object

Passing postal codes of london to get the geographical co-ordinates

In [11]:
LatLngUK = CoordinatesUK.apply(lambda x: get_x_y_uk(x))
LatLngUK.head()
Out[11]:
0    51.492450000000076,0.12127000000003818
1     51.51324000000005,-0.2674599999999714
6    51.51200000000006,-0.08057999999994081
7    51.51651000000004,-0.11967999999995982
9    51.41009000000008,-0.05682999999993399
Name: post_code, dtype: object

Latitude

Extracting the latitude from our previously collected coordinates

In [16]:
LatUK = LatLngUK.apply(lambda x: x.split(',')[0])
LatUK.head()
Out[16]:
0    51.492450000000076
1     51.51324000000005
6     51.51200000000006
7     51.51651000000004
9     51.41009000000008
Name: post_code, dtype: object

Longitude

Extracting the Longitude from our previously collected coordinates

In [17]:
LngUK = LatLngUK.apply(lambda x: x.split(',')[1])
LngUK.head()
Out[17]:
0     0.12127000000003818
1     -0.2674599999999714
6    -0.08057999999994081
7    -0.11967999999995982
9    -0.05682999999993399
Name: post_code, dtype: object

We now have the geographical co-ordinates of the London Neighborhoods.

We proceed with Merging our source data with the geographical co-ordinates to make our dataset ready for the next stage

In [18]:
LondonLatLng = pd.concat([df1,LatUK.astype(float), LngUK.astype(float)], axis=1)
LondonLatLng.columns= ['borough','town','post_code','latitude','longitude']
LondonLatLng.head()
Out[18]:
borough town post_code latitude longitude
0 Bexley, Greenwich LONDON SE2 51.49245 0.12127
1 Ealing, Hammersmith and Fulham LONDON W3, W4 51.51324 -0.26746
6 City LONDON EC3 51.51200 -0.08058
7 Westminster LONDON WC2 51.51651 -0.11968
9 Bromley LONDON SE20 51.41009 -0.05683
In [19]:
LondonLatLng.dtypes
Out[19]:
borough       object
town          object
post_code     object
latitude     float64
longitude    float64
dtype: object

Co-ordinates for London

Getting the geocode for London

In [20]:
LondonGeoCodes = geocode(address='London, England, GBR')[0]
LondonLat = LondonGeoCodes['location']['y']
LondonLng = LondonGeoCodes['location']['x']
print(LondonLat)
print(LondonLng)
51.50642000000005
-0.1272099999999341

Results : Visualize the Map of London

In [46]:
# Creating the map of London
LondonMap = folium.Map(location=[LondonLat, LondonLng], zoom_start=12)
LondonMap

# adding markers to map
for latitude, longitude, borough, town in zip(LondonLatLng['latitude'], LondonLatLng['longitude'], LondonLatLng['borough'], LondonLatLng['town']):
    label = '{}, {}'.format(town, borough)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=5,
        popup=label,
        color='red',
        fill=True
        ).add_to(LondonMap)  
    
LondonMap
Out[46]:
Make this Notebook Trusted to load map: File -> Trust Notebook

3.Explore and Cluster the Neighborhoods in London

To proceed with the next part, we need to define Foursquare API credentials.

Using Foursquare API, we are able to get the venue and venue categories around each Neighborhood in London.

In [22]:
CLIENT_ID = 'RDKH2NBTC4WBR4GLAY3VE4LC01ECVNZBDWJL2VVUI3IPQARR' # your Foursquare ID
CLIENT_SECRET = 'JNIOHCCN3IML1NKDG0ZQPPC30LHOCZYNM4SSMFTOSYXSVLHY' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
LIMIT = 100 # A default Foursquare API limit value
In [23]:
LondonLatLng.head()
Out[23]:
borough town post_code latitude longitude
0 Bexley, Greenwich LONDON SE2 51.49245 0.12127
1 Ealing, Hammersmith and Fulham LONDON W3, W4 51.51324 -0.26746
6 City LONDON EC3 51.51200 -0.08058
7 Westminster LONDON WC2 51.51651 -0.11968
9 Bromley LONDON SE20 51.41009 -0.05683

Defining a function to get the neraby venues in the Neighborhood. This will help us get venue categories which is important for our analysis

In [24]:
LIMIT=100

def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius,
            LIMIT
            )
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Latitude', 
                  'Longitude', 
                  'Venue', 
                  'Venue Category']
    
    return(nearby_venues)

Collect the venues in London

In [25]:
#save the neighbours data ane use it due to foursqure per day limitation
VenuesDf = getNearbyVenues(LondonLatLng['borough'], LondonLatLng['latitude'], LondonLatLng['longitude'])
VenuesDf.head()
Out[25]:
Neighborhood Latitude Longitude Venue Venue Category
0 Bexley, Greenwich 51.49245 0.12127 Lesnes Abbey Historic Site
1 Bexley, Greenwich 51.49245 0.12127 Sainsbury's Supermarket
2 Bexley, Greenwich 51.49245 0.12127 Lidl Supermarket
3 Bexley, Greenwich 51.49245 0.12127 Abbey Wood Railway Station (ABW) Train Station
4 Bexley, Greenwich 51.49245 0.12127 Bean @ Work Coffee Shop
In [26]:
# saving the London Venues dataframe 
#VenuesDf.to_csv('LondonVenues.csv', index=False)
#VenuesDf = VenuesDf.drop(VenuesDf.columns[[0]], axis=1)
#VenuesDf = pd.read_csv("LondonVenues.csv") 
#VenuesDf.head()
In [27]:
VenuesDf.columns = ['Neighborhood', 'Latitude', 'Longitude', 'VenueName', 'VenueCategory']
VenuesDf.sort_values(["Neighborhood"], inplace=True, ascending=True)
In [28]:
VenuesDf.shape
Out[28]:
(10354, 5)

10354 records for venues.

Grouping by Venue Categories

Unique Venue Categories

In [29]:
VenuesDf.groupby(["Neighborhood"]).count()
print('There are {} uniques categories.'.format(len(VenuesDf['VenueCategory'].unique())))
There are 298 uniques categories.

We can see 298 records.

4. Analyze Each Neighborhood

One Hot Encoding

We need to Encode our venue categories to get a better result for our clustering

In [30]:
# one hot encoding
onehot = pd.get_dummies(VenuesDf[['VenueCategory']], prefix="", prefix_sep="")

# add neighborhood column back to dataframe
onehot['Neighborhoods'] = VenuesDf['Neighborhood'] 

# move neighborhood column to the first column
fixed_columns = [onehot.columns[-1]] + list(onehot.columns[:-1])
onehot = onehot[fixed_columns]

print(onehot.shape)

grouped = onehot.groupby(["Neighborhoods"]).mean().reset_index()

print(grouped.shape)
grouped.head()
(10354, 299)
(50, 299)
Out[30]:
Neighborhoods Accessories Store Adult Boutique African Restaurant American Restaurant Antique Shop Arcade Arepa Restaurant Argentinian Restaurant Art Gallery Art Museum Arts & Crafts Store Asian Restaurant Athletics & Sports Australian Restaurant Auto Garage Auto Workshop BBQ Joint Bagel Shop Bakery Bar Beach Bed & Breakfast Beer Bar Beer Garden Beer Store Bike Rental / Bike Share Bike Shop Bistro Boarding House Bookstore Botanical Garden Boutique Bowling Alley Boxing Gym Brasserie Brazilian Restaurant Breakfast Spot Brewery Bridal Shop Building Burger Joint Burrito Place Bus Station Bus Stop Butcher Café Camera Store Canal Candy Store Caribbean Restaurant Caucasian Restaurant Cemetery Champagne Bar Cheese Shop Child Care Service Chinese Restaurant Chocolate Shop Church Circus School Climbing Gym Clothing Store Cocktail Bar Coffee Shop College Quad Comedy Club Comfort Food Restaurant Community Center Concert Hall Construction & Landscaping Convenience Store Convention Center Cosmetics Shop Costume Shop Creperie Cricket Ground Cupcake Shop Cycle Studio Dance Studio Deli / Bodega Department Store Dessert Shop Dim Sum Restaurant Diner Discount Store Distillery Dog Run Doner Restaurant Donut Shop Dry Cleaner Eastern European Restaurant Electronics Store English Restaurant Escape Room Ethiopian Restaurant Event Space Exhibit Falafel Restaurant Farmers Market Fast Food Restaurant Filipino Restaurant Film Studio Fish & Chips Shop Fish Market Flea Market Flower Shop Food & Drink Shop Food Court Food Service Food Stand Food Truck Fountain French Restaurant Fried Chicken Joint Fruit & Vegetable Store Furniture / Home Store Gaming Cafe Garden Garden Center Gas Station Gastropub Gay Bar Gelato Shop General Entertainment German Restaurant Gift Shop Golf Course Gourmet Shop Greek Restaurant Grilled Meat Restaurant Grocery Store Gym Gym / Fitness Center Gym Pool Halal Restaurant Hardware Store Health & Beauty Service Health Food Store Herbs & Spices Store Historic Site History Museum Home Service Hookah Bar Hostel Hotel Hotel Bar Hungarian Restaurant Ice Cream Shop Indian Restaurant Indie Movie Theater Indie Theater Indoor Play Area Irish Pub Italian Restaurant Japanese Restaurant Jazz Club Jewelry Store Juice Bar Kebab Restaurant Kids Store Kitchen Supply Store Korean Restaurant Kosher Restaurant Lake Latin American Restaurant Lebanese Restaurant Library Light Rail Station Lingerie Store Liquor Store Locksmith Lounge Malay Restaurant Market Martial Arts School Massage Studio Mediterranean Restaurant Men's Store Metro Station Mexican Restaurant Middle Eastern Restaurant Mini Golf Miscellaneous Shop Mobile Phone Shop Modern European Restaurant Monument / Landmark Moroccan Restaurant Motorcycle Shop Movie Theater Multiplex Museum Music Store Music Venue Nail Salon Nature Preserve Nightclub Noodle House North Indian Restaurant Office Okonomiyaki Restaurant Opera House Optical Shop Organic Grocery Outdoor Sculpture Pakistani Restaurant Paper / Office Supplies Store Park Pedestrian Plaza Performing Arts Venue Persian Restaurant Peruvian Restaurant Pet Store Pharmacy Pie Shop Pier Pilates Studio Pizza Place Platform Playground Plaza Poke Place Polish Restaurant Pool Portuguese Restaurant Post Office Pub Ramen Restaurant Record Shop Recording Studio Recreation Center Rental Car Location Reservoir Residential Building (Apartment / Condo) Restaurant Sake Bar Salad Place Sandwich Place Scandinavian Restaurant Scenic Lookout School Science Museum Sculpture Garden Seafood Restaurant Shaanxi Restaurant Shoe Store Shop & Service Shopping Mall Shopping Plaza Skate Park Snack Place Soccer Field Soccer Stadium Social Club Soup Place South American Restaurant South Indian Restaurant Souvenir Shop Spa Spanish Restaurant Speakeasy Sporting Goods Shop Sports Bar Stables Stationery Store Steakhouse Street Food Gathering Student Center Supermarket Sushi Restaurant Szechuan Restaurant Taco Place Tapas Restaurant Tea Room Tennis Court Thai Restaurant Theater Thrift / Vintage Store Tour Provider Tourist Information Center Toy / Game Store Track Trail Train Station Tunnel Turkish Restaurant University Vape Store Vegetarian / Vegan Restaurant Video Game Store Vietnamese Restaurant Warehouse Store Whisky Bar Wine Bar Wine Shop Wings Joint Women's Store Xinjiang Restaurant Yoga Studio Zoo Exhibit
0 Barnet 0.0 0.0 0.0 0.001821 0.0 0.0 0.0 0.007286 0.0 0.0 0.0 0.020036 0.0 0.0 0.0 0.007286 0.001821 0.01275 0.014572 0.005464 0.0 0.0 0.005464 0.0 0.0 0.0 0.0 0.005464 0.0 0.005464 0.0 0.0 0.0 0.0 0.0 0.0 0.005464 0.0 0.0 0.0 0.0 0.0 0.0 0.043716 0.0 0.067395 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.001821 0.0 0.025501 0.0 0.0 0.0 0.0 0.001821 0.0 0.109290 0.0 0.0 0.0 0.0 0.0 0.000000 0.003643 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.001821 0.007286 0.0 0.001821 0.0 0.001821 0.005464 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.003643 0.0 0.0 0.0 0.0 0.001821 0.0 0.018215 0.0 0.0 0.005464 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.005464 0.005464 0.0 0.005464 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.005464 0.000000 0.0 0.007286 0.0 0.061931 0.0 0.018215 0.0 0.0 0.009107 0.003643 0.0 0.0 0.000000 0.0 0.0 0.009107 0.0 0.018215 0.0 0.0 0.007286 0.016393 0.0 0.0 0.0 0.005464 0.034608 0.021858 0.0 0.0 0.009107 0.0 0.0 0.0 0.005464 0.001821 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.009107 0.0 0.010929 0.0 0.001821 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.005464 0.0 0.0 0.001821 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.001821 0.0 0.0 0.0 0.016393 0.0 0.0 0.0 0.0 0.0 0.030965 0.0 0.0 0.0 0.025501 0.014572 0.0 0.001821 0.0 0.0 0.0 0.007286 0.0 0.041894 0.0 0.003643 0.0 0.0 0.0 0.0 0.0 0.005464 0.0 0.0 0.020036 0.0 0.0 0.0 0.0 0.0 0.007286 0.0 0.003643 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.001821 0.0 0.0 0.0 0.0 0.009107 0.001821 0.0 0.005464 0.034608 0.025501 0.0 0.0 0.001821 0.005464 0.0 0.005464 0.005464 0.0 0.0 0.0 0.001821 0.0 0.0 0.016393 0.0 0.027322 0.0 0.0 0.001821 0.0 0.001821 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 Barnet, Brent, Camden 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.00000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.200000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.200000 0.0 0.0 0.200000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.200000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.200000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 Bexley 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.00000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.038462 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.0 0.115385 0.0 0.0 0.0 0.0 0.0 0.038462 0.115385 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.038462 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.153846 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.038462 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.115385 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.230769 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.115385 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 Bexley, Greenwich 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.00000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.166667 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.166667 0.166667 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.166667 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.166667 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.166667 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 Bexley, Greenwich 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.00000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.0 0.142857 0.0 0.0 0.0 0.0 0.0 0.000000 0.142857 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.142857 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.000000 0.142857 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.285714 0.000000 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.0 0.000000 0.0 0.0 0.142857 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Analysis of Top Restaurant Venues Category in London

In [31]:
CategoryDf = pd.DataFrame(VenuesDf['VenueCategory'].value_counts()).reset_index()
CategoryDf.columns = ['VenueCategory','Total']
CategoryDf = CategoryDf[CategoryDf['VenueCategory'].str.contains('Restaurant')]
CategoryDf.head()
Out[31]:
VenueCategory Total
4 Italian Restaurant 316
12 Indian Restaurant 167
14 Restaurant 144
17 Fast Food Restaurant 123
21 Chinese Restaurant 110

Bar Chart for Venue Category wise Count

In [32]:
import plotly.express as px
topvenues_barchart = px.bar(CategoryDf.query("Total>48"),
                            x="VenueCategory",
                            y="Total", 
                            color="VenueCategory")

topvenues_barchart.update_layout(title = 'London Venue Category with venues',
                         margin={"r":0,"t":30,"l":0,"b":0})

topvenues_barchart.update_xaxes(showticklabels=False) # Removed tick labels as it was too long
topvenues_barchart.show() # Display plot
In [34]:
# len(grouped[grouped["Japanese Restaurant"] > 0])
LondonRest = grouped[["Neighborhoods","Japanese Restaurant"]]
LondonRest
Out[34]:
Neighborhoods Japanese Restaurant
0 Barnet 0.021858
1 Barnet, Brent, Camden 0.000000
2 Bexley 0.000000
3 Bexley, Greenwich 0.000000
4 Bexley, Greenwich 0.000000
5 Brent 0.000000
6 Brent, Camden 0.000000
7 Brent, Ealing 0.000000
8 Brent, Harrow 0.027778
9 Bromley 0.000000
10 Camden 0.019656
11 Camden, Islington 0.000000
12 City 0.012931
13 City, Westminster 0.010000
14 Croydon 0.013889
15 Ealing 0.000000
16 Ealing, Hammersmith and Fulham 0.000000
17 Enfield 0.000000
18 Greenwich 0.012821
19 Greenwich, Lewisham 0.000000
20 Hackney 0.001656
21 Hammersmith and Fulham 0.012690
22 Haringey 0.024691
23 Haringey, Barnet 0.044444
24 Haringey, Islington 0.000000
25 Harrow, Brent 0.000000
26 Hounslow 0.000000
27 Hounslow, Ealing, Hammersmith and Fulham 0.000000
28 Islington 0.000000
29 Islington, Camden 0.000000
30 Islington, City 0.000000
31 Kensington and Chelsea 0.008757
32 Kensington and Chelsea, Hammersmith and Fulham 0.052083
33 Kingston upon Thames 0.035714
34 Lambeth 0.000000
35 Lambeth, Southwark 0.000000
36 Lambeth, Wandsworth 0.012987
37 Lewisham 0.002532
38 Lewisham, Bromley 0.000000
39 Lewisham, Southwark 0.000000
40 Merton 0.007042
41 Newham 0.000000
42 Redbridge 0.000000
43 Redbridge, Waltham Forest 0.019802
44 Richmond upon Thames 0.000000
45 Southwark 0.000000
46 Tower Hamlets 0.006438
47 Waltham Forest 0.000000
48 Wandsworth 0.005405
49 Westminster 0.008371

We can see 50 records, just goes to show how diverse and interesting the place is.

5. Cluster Neighborhoods

In [35]:
kclusters = 4

LondonClustering = LondonRest.drop(["Neighborhoods"], 1)
#LondonClustering.head()

# run k-means clustering
kmeans = KMeans(init="k-means++", n_clusters=kclusters, n_init=12).fit(LondonClustering)

# check cluster labels generated for each row in the dataframe
kmeans.labels_[0:10]
Out[35]:
array([2, 0, 0, 0, 0, 0, 0, 0, 2, 0], dtype=int32)
In [36]:
Merged = LondonRest.copy()

# add clustering labels
Merged["Category"] = kmeans.labels_
Merged.rename(columns={"Neighborhoods": "Neighborhood"}, inplace=True)
Merged.sort_values(["Neighborhood"], inplace=True, ascending=False)
Merged.head()
Out[36]:
Neighborhood Japanese Restaurant Category
49 Westminster 0.008371 3
48 Wandsworth 0.005405 3
47 Waltham Forest 0.000000 0
46 Tower Hamlets 0.006438 3
45 Southwark 0.000000 0

Neighbourhood Data Wranging

In [37]:
Londondf = VenuesDf[VenuesDf.columns[0]] 
Londondf = pd.DataFrame(Londondf).reset_index()
#Eliminate the Duplicatte values with Special chars
Londondf = Londondf.replace('Bexley, Greenwich ', 'Bexley, Greenwich', regex=True)
Londondf['ChrLen'] = Londondf['Neighborhood'].str.len()
Londondf = Londondf.drop_duplicates(subset=['Neighborhood'], keep='first')
Londondf.sort_values(["Neighborhood"], inplace=True, ascending=True)
Londondf = pd.DataFrame(Londondf).reset_index() 

#Remove unwanted columns and indexes 
Londondf = Londondf.drop(Londondf.columns[[0]], axis=1)
Londondf = Londondf.drop(Londondf.columns[[0]], axis=1)
Londondf = Londondf.drop(Londondf.columns[[1]], axis=1)
Londondf.head()
Out[37]:
Neighborhood
0 Barnet
1 Barnet, Brent, Camden
2 Bexley
3 Bexley, Greenwich
4 Brent
In [38]:
# Geographical coordinates of neighborhoods
textList = []
neighborhoodList = []
# define a function to get coordinates
def get_latlng(neighborhood):
    # initialize your variable to None
    lat_lng_coords = None
    # loop until you get the coordinates
    while(lat_lng_coords is None):
        g = geocoder.arcgis('{}, London, England, GBR'.format(neighborhood))
        lat_lng_coords = g.latlng
    return lat_lng_coords

coords = [ get_latlng(neighborhood) for neighborhood in Londondf["Neighborhood"].tolist() ]

CoordsDf = pd.DataFrame(coords, columns=['Latitude', 'Longitude'])

# merge the coordinates into the original dataframe
Londondf['Latitude'] = CoordsDf['Latitude']
Londondf['Longitude'] = CoordsDf['Longitude']

# check the neighborhoods and the coordinates
print(Londondf.shape)
Londondf.head()
(49, 3)
Out[38]:
Neighborhood Latitude Longitude
0 Barnet 51.527095 -0.066826
1 Barnet, Brent, Camden 51.532360 -0.127960
2 Bexley 51.452078 0.069931
3 Bexley, Greenwich 51.476850 0.019210
4 Brent 51.609783 -0.194672

Adding Neighborhood into the mix.

In [39]:
MergedDf = Merged.merge(Londondf)
MergedDf.head()

#Sort
MergedDf.sort_values(["Category"], inplace=True, ascending=False)
MergedDf
Out[39]:
Neighborhood Japanese Restaurant Category Latitude Longitude
0 Westminster 0.008371 3 51.500100 -0.128030
18 Kensington and Chelsea 0.008757 3 51.522660 -0.207930
3 Tower Hamlets 0.006438 3 51.520220 -0.054310
37 City 0.012931 3 51.519630 -0.163110
36 City, Westminster 0.010000 3 51.497400 -0.137350
35 Croydon 0.013889 3 51.593480 -0.083420
31 Greenwich 0.012821 3 51.484540 0.002750
9 Merton 0.007042 3 51.415640 -0.191420
28 Hammersmith and Fulham 0.012690 3 51.482600 -0.212880
1 Wandsworth 0.005405 3 51.456820 -0.194520
13 Lambeth, Wandsworth 0.012987 3 51.456820 -0.194520
41 Brent, Harrow 0.027778 2 51.513180 -0.106980
39 Camden 0.019656 2 51.532360 -0.127960
27 Haringey 0.024691 2 51.589270 -0.106405
48 Barnet 0.021858 2 51.527095 -0.066826
6 Redbridge, Waltham Forest 0.019802 2 51.587770 -0.027970
16 Kingston upon Thames 0.035714 1 51.413560 -0.305660
17 Kensington and Chelsea, Hammersmith and Fulham 0.052083 1 51.496030 -0.207980
26 Haringey, Barnet 0.044444 1 51.580810 -0.093760
40 Bromley 0.000000 0 51.601511 -0.066365
7 Redbridge 0.000000 0 51.475773 -0.080698
5 Richmond upon Thames 0.000000 0 51.480270 -0.237540
38 Camden, Islington 0.000000 0 51.532790 -0.106140
4 Southwark 0.000000 0 51.505410 -0.089190
15 Lambeth 0.000000 0 51.490840 -0.111080
34 Ealing 0.000000 0 51.514060 -0.300730
42 Brent, Ealing 0.000000 0 51.514060 -0.300730
43 Brent, Camden 0.000000 0 51.532360 -0.127960
44 Brent 0.000000 0 51.609783 -0.194672
45 Bexley, Greenwich 0.000000 0 51.476850 0.019210
46 Bexley 0.000000 0 51.452078 0.069931
47 Barnet, Brent, Camden 0.000000 0 51.532360 -0.127960
2 Waltham Forest 0.000000 0 51.630613 -0.016275
32 Enfield 0.000000 0 51.540024 -0.077502
33 Ealing, Hammersmith and Fulham 0.000000 0 51.482600 -0.212880
8 Newham 0.000000 0 51.519716 0.051479
30 Greenwich, Lewisham 0.000000 0 51.448255 -0.002360
29 Hackney 0.001656 0 51.545050 -0.055320
10 Lewisham, Southwark 0.000000 0 51.505410 -0.089190
11 Lewisham, Bromley 0.000000 0 51.459160 -0.012130
25 Haringey, Islington 0.000000 0 51.532790 -0.106140
12 Lewisham 0.002532 0 51.459160 -0.012130
23 Hounslow 0.000000 0 51.447300 -0.408880
22 Hounslow, Ealing, Hammersmith and Fulham 0.000000 0 51.477720 -0.201450
21 Islington 0.000000 0 51.532790 -0.106140
20 Islington, Camden 0.000000 0 51.532890 -0.137479
19 Islington, City 0.000000 0 51.531340 -0.102770
14 Lambeth, Southwark 0.000000 0 51.494471 -0.120066
24 Harrow, Brent 0.000000 0 51.513180 -0.106980

Results : Visualising Clusters

In [47]:
# Creating the map of London
ClustersMap = folium.Map(location=[LondonLat, LondonLng], zoom_start=12)
ClustersMap

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i+x+(i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for latitude, longitude, poi, cluster in zip(MergedDf['Latitude'], MergedDf['Longitude'], MergedDf['Neighborhood'], MergedDf['Category']):
    label = folium.Popup(str(poi) + ' - Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [latitude, longitude],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(ClustersMap)  
    
ClustersMap
Out[47]:
Make this Notebook Trusted to load map: File -> Trust Notebook

6. Examine Clusters

Category 1: Neighborhoods with very low number of restaurants

In [41]:
Merged.loc[Merged['Category'] == 0]
Out[41]:
Neighborhood Japanese Restaurant Category
47 Waltham Forest 0.000000 0
45 Southwark 0.000000 0
44 Richmond upon Thames 0.000000 0
42 Redbridge 0.000000 0
41 Newham 0.000000 0
39 Lewisham, Southwark 0.000000 0
38 Lewisham, Bromley 0.000000 0
37 Lewisham 0.002532 0
35 Lambeth, Southwark 0.000000 0
34 Lambeth 0.000000 0
30 Islington, City 0.000000 0
29 Islington, Camden 0.000000 0
28 Islington 0.000000 0
27 Hounslow, Ealing, Hammersmith and Fulham 0.000000 0
26 Hounslow 0.000000 0
25 Harrow, Brent 0.000000 0
24 Haringey, Islington 0.000000 0
20 Hackney 0.001656 0
19 Greenwich, Lewisham 0.000000 0
17 Enfield 0.000000 0
16 Ealing, Hammersmith and Fulham 0.000000 0
15 Ealing 0.000000 0
11 Camden, Islington 0.000000 0
9 Bromley 0.000000 0
7 Brent, Ealing 0.000000 0
6 Brent, Camden 0.000000 0
5 Brent 0.000000 0
4 Bexley, Greenwich 0.000000 0
3 Bexley, Greenwich 0.000000 0
2 Bexley 0.000000 0
1 Barnet, Brent, Camden 0.000000 0

Category 2: Neighborhoods with low number of restaurants

In [42]:
Merged.loc[Merged['Category'] == 1]
Out[42]:
Neighborhood Japanese Restaurant Category
33 Kingston upon Thames 0.035714 1
32 Kensington and Chelsea, Hammersmith and Fulham 0.052083 1
23 Haringey, Barnet 0.044444 1

Category 3: Neighborhoods with a significant number of restaurants

In [43]:
Merged.loc[Merged['Category'] == 2]
Out[43]:
Neighborhood Japanese Restaurant Category
43 Redbridge, Waltham Forest 0.019802 2
22 Haringey 0.024691 2
10 Camden 0.019656 2
8 Brent, Harrow 0.027778 2
0 Barnet 0.021858 2

Category 4: Neighborhoods crowded with restaurants

In [44]:
Merged.loc[Merged['Category'] == 3]
Out[44]:
Neighborhood Japanese Restaurant Category
49 Westminster 0.008371 3
48 Wandsworth 0.005405 3
46 Tower Hamlets 0.006438 3
40 Merton 0.007042 3
36 Lambeth, Wandsworth 0.012987 3
31 Kensington and Chelsea 0.008757 3
21 Hammersmith and Fulham 0.012690 3
18 Greenwich 0.012821 3
14 Croydon 0.013889 3
13 City, Westminster 0.010000 3
12 City 0.012931 3

7. Discussion & Conclusion

It is clear the Category 3 are very crowded with Japanese Restaurants, and hence, Category 2 would be the best bet for opening a new restuarant because of not too much competition in these regions, but still a proven market. Client with USPs to stand out from the competition can also open new restaurants in neighborhoods in Cluster 1 with moderate competition.

Thank You